pandas数据分析100道练习题-第二部分

2018年08月15日

文章目录

1. series如何将一日期字符串转换为时间
2. series如何从时间序列中提取年/月/天/小时/分钟/秒
3. 从series中找出包含两个以上元音字母的单词
4. 如何过滤series中的有效电子邮件
5. series A 以series B为分组依据, 然后计算分组后的平均值
6. 如何计算两个系列之间的欧氏距离
7. 如何在数字系列中查找所有局部最大值（或峰值）
8. 如何创建一个以’2000-01-02’开始包含10个周六的TimeSeries
9. 如何填补TimeSeires的缺失日期
10. 如何计算series的自相关
11. 读取csv时, 间隔几行读取数据
12. 读取csv时进行数据转换
13. 读取csv时只读取某列
14. 读取dataframe每列的数据类型
15. 读取dataframe的行数和列数
16. 获取dataframe每列的基本描述统计
17. 从dataframe中找到a列最大值对应的行
18. 从dataframe中获取c列最大值所在的行号
19. 在dataframe中根据行列数读取某个值
20. 在dataframe中根据index和列名称读取某个值
21. dataframe中重命名某一列

这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。第一部分pandas练习题请访问: pandas数据分析100道练习题-第一部分, 下面是第二部分:

series如何将一日期字符串转换为时间

import pandas as pd
ser = pd.Series(['01 Jan 2010', 
                '02-02-2011', 
                 '20120303', 
                 '2013/04/04', 
                 '2014-05-05', 
                 '2015-06-06T12:20'])

pd.to_datetime(ser)

输出(plain):
0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
dtype: datetime64[ns]

series如何从时间序列中提取年/月/天/小时/分钟/秒

date = pd.Series(['01 Jan 2010', 
                '02-02-2011', 
                 '20120303', 
                 '2013/04/04', 
                 '2014-05-05', 
                 '2015-06-06T12:20'])
date = pd.to_datetime(date)
date.dt.year

输出(plain):
0 2010
1 2011
2 2012
3 2013
4 2014
5 2015
dtype: int64

1	date.dt.month

输出(plain):
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64

1	date.dt.day

输出(plain):
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64

1	date.dt.hour

输出(plain):
0 0
1 0
2 0
3 0
4 0
5 12
dtype: int64

从series中找出包含两个以上元音字母的单词

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

def count(x):
    aims = 'aeiou'
    c= 0
    for i in x:
        if i in aims:
            c += 1
    return c

counts = ser.map(lambda x: count(x))
ser[counts>=2]

输出(plain):
1 Orange
4 Money
dtype: object

如何过滤series中的有效电子邮件

emails = pd.Series(['buying books at amazom.com', 
                    'rameses@egypt.com', 
                    'matt@t.co',
                    'narendra@modi.com'])

import re
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
valid = emails.str.findall(pattern, flags=re.IGNORECASE)
[x[0] for x in valid if len(x)]

输出(plain):
['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

series A 以series B为分组依据, 然后计算分组后的平均值

import numpy as np
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

weights.groupby(fruit).mean()

输出(plain):
apple 9.00
banana 4.75
carrot 3.00
dtype: float64

如何计算两个系列之间的欧氏距离

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

sum((p - q)**2)**.5

输出(plain):
18.16590212458495

如何在数字系列中查找所有局部最大值（或峰值）

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
dd = np.diff(np.sign(np.diff(ser)))
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

输出(plain):
array([1, 5, 7], dtype=int64)

如何创建一个以’2000-01-02’开始包含10个周六的TimeSeries

pd.Series(np.random.randint(1,10,10), 
          pd.date_range('2000-01-02', 
                        periods=10, 
                        freq='W-SAT'))

输出(plain):
2000-01-08 5
2000-01-15 4
2000-01-22 2
2000-01-29 1
2000-02-05 4
2000-02-12 8
2000-02-19 1
2000-02-26 6
2000-03-04 6
2000-03-11 2
Freq: W-SAT, dtype: int32

如何填补TimeSeires的缺失日期

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01',
                                                       '2000-01-03', 
                                                       '2000-01-06', 
                                                       '2000-01-08']))
# 使用前一个日期的数据填补
ser.resample('D').ffill()
# 如果使用后一个日期的数据填补, 可以使用bfill方法

输出(plain):
2000-01-01 1.0
2000-01-02 1.0
2000-01-03 10.0
2000-01-04 10.0
2000-01-05 10.0
2000-01-06 3.0
2000-01-07 3.0
2000-01-08 NaN
Freq: D, dtype: float64

如何计算series的自相关

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]

autocorrelations

输出(plain):
[1.0, 0.38, 0.12, 0.17, 0.44, 0.48, 0.25, -0.31, -0.1, 0.65, 0.05]

读取csv时, 间隔几行读取数据

# 生成用于测试的csv
fpath = 'testt.csv'
df = pd.DataFrame({'a': range(100), 
                   'b':np.random.choice(['apple', 'banana', 'carrot'], 100)})
df.to_csv(fpath, index=None)

### 隔行读取csv
import csv

with open(fpath, 'r') as f:
    reader = csv.reader(f)
    out = []
    for i, row in enumerate(reader):
        if i%20 ==0:
            out.append(row)
pd.DataFrame(out[1:], columns=out[0])

输出(html):

	a	b
0	19	banana
1	39	carrot
2	59	banana
3	79	banana
4	99	apple

读取csv时进行数据转换

pd.read_csv(fpath, 
            converters={
                'a':lambda x: 'low' if int(x) < 50 else 'high'
            }).head()

输出(html):

	a	b
0	low	carrot
1	low	carrot
2	low	banana
3	low	apple
4	low	apple

读取csv时只读取某列

1	pd.read_csv(fpath, usecols=['a']).head()

输出(html):

	a
0	0
1	1
2	2
3	3
4	4

读取dataframe每列的数据类型

df=pd.DataFrame(
    {
        'a':range(100),
        'b':np.random.rand(100),
        'c':[1,2,3,4]*25,
        'd':['apple', 'banana', 'carrot']*33 + ['apple']
    }
)

df.dtypes

输出(plain):
a int64
b float64
c int64
d object
dtype: object

读取dataframe的行数和列数

df.shape

输出(plain):
(100, 4)

获取dataframe每列的基本描述统计

1 2	df.describe()

输出(html):

	a	b	c
count	100.000000	100.000000	100.000000
mean	49.500000	0.515885	2.500000
std	29.011492	0.281679	1.123666
min	0.000000	0.000605	1.000000
25%	24.750000	0.280289	1.750000
50%	49.500000	0.545348	2.500000
75%	74.250000	0.736113	3.250000
max	99.000000	0.992075	4.000000

从dataframe中找到a列最大值对应的行

1	df.loc[df.a==np.max(df.a)]

输出(html):

	a	b	c	d
99	99	0.598169	4	apple

从dataframe中获取c列最大值所在的行号

1	np.where(df.c==np.max(df.c))

输出(plain):
(array([ 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67,
71, 75, 79, 83, 87, 91, 95, 99], dtype=int64),)

在dataframe中根据行列数读取某个值

row = 4
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 4
col = 2
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 0
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 33
col = 3
print(f'行{row}列{col}的值是: {df.iat[row, col]}')

输出(stream):
行4列0的值是: 4
行4列2的值是: 1
行0列0的值是: 0
行33列3的值是: apple

在dataframe中根据index和列名称读取某个值

index = 0
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 2
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 4
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 5
col = 'c'
print(f'index={index}, col={col} : {df.at[index, col]}')

输出(stream):
index=0, col=d : apple
index=2, col=d : carrot
index=4, col=d : banana
index=5, col=c : 2

dataframe中重命名某一列

1	df.rename(columns={'d':'fruit'}).head()

输出(html):

	a	b	c	fruit
0	0	0.406456	1	apple
1	1	0.607407	2	banana
2	2	0.197953	3	carrot
3	3	0.279180	4	apple
4	4	0.193107	1	banana

今天的教程就到此为止了, 希望大家关注我的小站mlln.cn, 后面还会有关于pandas系列的练习题, 希望这些工作能帮助你学习pandas, 或者在面试的时候应付面试题。

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

#python #pandas

pandas数据分析100道练习题-第二部分

series如何将一日期字符串转换为时间

series如何从时间序列中提取年/月/天/小时/分钟/秒

从series中找出包含两个以上元音字母的单词

如何过滤series中的有效电子邮件

series A 以series B为分组依据, 然后计算分组后的平均值

如何计算两个系列之间的欧氏距离

如何在数字系列中查找所有局部最大值（或峰值）

如何创建一个以’2000-01-02’开始包含10个周六的TimeSeries

如何填补TimeSeires的缺失日期

如何计算series的自相关

读取csv时, 间隔几行读取数据

读取csv时进行数据转换

读取csv时只读取某列

读取dataframe每列的数据类型

读取dataframe的行数和列数

获取dataframe每列的基本描述统计

从dataframe中找到a列最大值对应的行

从dataframe中获取c列最大值所在的行号

在dataframe中根据行列数读取某个值

在dataframe中根据index和列名称读取某个值

dataframe中重命名某一列

统计咨询

赞助

赞助推荐

常用工具

python

pandas

友商赞助